Take-home Exercise 1: Demographic structures and distribution of Singapore in 2024

Author

Vanessa Riadi

Published

May 4, 2025

Modified

May 4, 2025

1 Overview

A local online media company that publishes daily content on digital platforms is planning to release an article on demographic structures and distribution of Singapore in 2024.

2 Objective

Assuming the role of the graphical editor of the media company, you are tasked to prepare at most three data visualisations for the article.

3 Analytical Toolkit: RStudio

RStudio and Quarto are used as the primary analytical toolkit for this project. The data is processed using appropriate tidyverse family of packages and the data visualisation prepared using ggplot2 and its extensions.

Before we get started, it is important for us to ensure that the required R packages have been installed.

Install pacman package

If you have yet to install pacman, install itby typing below in the Console:

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("pacman")

We then load the following R packages using the pacman::p_load() function:

- tidyverse, a family of modern R packages specially designed to support data science, analysis and communication task including creating static statistical graphs.
- patchwork for combining multiple ggplot2 graphs into one figure.
- plotly, R library for plotting interactive statistical graphs.
- ggrepel: a R package provides geoms for ggplot2 to repel overlapping text labels.
- ggthemes: a R package provides some extra themes, geoms, and scales for ggplot.
- hrbrthemes: a R package provides typography-centric themes and theme components for ggplot2.
- qreport: Provides statistical components, tables, and graphs. - ggiraph: for making ‘ggplot’ graphics interactive.

pacman::p_load(tidyverse, patchwork,
               plotly, ggrepel,
               ggthemes, hrbrthemes, ggiraph, DT, qreport)

4 Data Preparation

Singapore Residents by Planning Area / Subzone, Single Year of Age and Sex, June 2024 dataset shared by Department of Statistics, Singapore(DOS)

4.1 Load the Data

First we load the data.

demographic_data <- read_csv("data/respopagesexfa2024.csv")

4.2 Check the Data

From the first glance, we notice that there are ‘0’ Pop in the dataset. For this exercise, we will be focusing on the top-level visualization per Planning Areas and granular details like Population per Subzone and Floor Area aren’t necessary. We should exclude those rows with zero population values at the Planning Area level. This will help clean up the data and make the visualizations clearer by removing unnecessary zeros. We will do it in Chapter 4.3

Did you know?

There are more R packages that can help you to view or describe data. E.g. Hmisc, psych, qreport package. I will be using qreport here

If you have yet to install qreport, install it by typing below in the Console:

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("qreport")

Here I am using qreportpackage’s dataOverview that I already pre-load earlier.

dataOverview(demographic_data, plot = c("none"),)
demographic_data has 75696 observations (75696 complete) and 7 variables (7 complete)


|Variable |Type       | Distinct|  Info| Symmetry| NAs|Rarest Value            | Frequency of Rarest Value|Mode        | Frequency of Mode|
|:--------|:----------|--------:|-----:|--------:|---:|:-----------------------|-------------------------:|:-----------|-----------------:|
|PA       |Nonnumeric |       55| 0.999|    0.991|   0|Central Water Catchment |                       228|Bukit Merah |              3876|
|SZ       |Nonnumeric |      332| 1.000|    1.000|   0|Admiralty               |                       228|Admiralty   |               228|
|AG       |Discrete   |       19| 0.997|    1.000|   0|0_to_4                  |                      3984|0_to_4      |              3984|
|Sex      |Discrete   |        2| 0.750|    1.000|   0|Females                 |                     37848|Females     |             37848|
|FA       |Discrete   |        6| 0.972|    1.000|   0|<= 60                   |                     12616|<= 60       |             12616|
|Pop      |Continuous |      183| 0.831|    6.953|   0|1260                    |                         1|0           |             41742|
|Time     |Discrete   |        1| 0.000|    1.000|   0|2024                    |                     75696|2024        |             75696|

Let’s also count what’s the total Pop

cntpop <- demographic_data %>%
  summarise(Pop = sum(Pop, na.rm = TRUE)) 
cat(cntpop$Pop)
4187720

Observation

  • The data shows Singapore Residents by Planning Area / Subzone, Single Year of Age and Sex as of June 2024 with total population of 4,187,720.

  • We observe that there are 75,696 rows and 7 columns. No missing values are observed. Refer to the column legend in Appendix A

  • There are a total of seven attributes. 5 of them are categorical data type and the other three are in numerical data type.

    • The categorical attributes are: PA, SZ, AG, Sex, FA.
    • The numerical attributes are: Pop, Time.
  • We can also observe how many distinct values for each Variable. This will help us think what to use for our visualization.

4.3 Data Preparation

demographic_data_clean <- demographic_data %>%
  filter(Pop > 0)

DT::datatable(demographic_data_clean , options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)))

5 Data Visualisation, Observation, and Insights

5.1 Top 10 Planning Areas (PA) Ranked by Size of Resident Population (Pop)

top10PA <- demographic_data_clean %>%
  group_by(PA) %>%
  summarise(Pop = sum(Pop, na.rm = TRUE)) %>%
  slice_max(order_by = Pop, n = 10)

top10plot <- ggplot(data = top10PA, 
       aes(y = reorder(PA, Pop/1000), x = Pop/1000)) +  # reorder PA by Pop
  geom_col(show.legend = FALSE, fill = "pink4") +
  geom_text(aes(label = (Pop/1000)),
            hjust = -0.2, color = "black", size = 3) +
  ggtitle("Top 10 Planning Areas in 2024\nRanked by Size of Singapore Resident Population",
  subtitle = paste("Total resident population:", 
                 format(round(cntpop$Pop / 1000, 2), big.mark = ","), 
                 "thousand")) +
  labs(
    y = NULL,
    x = "Resident Population\nin thousands (‘000)",
    caption = "Source: singstat.gov.sg"
  ) +
  theme_ipsum(base_family = "Arial",
  plot_title_size = 14,
  subtitle_size = 10,
  caption_size = 8,
  plot_title_face = "bold",
  caption_face = "italic",
  grid = "",
  axis_title_face = "bold",
  axis_title_size = 11) +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_text(size=11, face="bold"),
        axis.title.x = element_text(hjust = 0.5)
        )+
  scale_x_continuous(expand = expansion(mult = c(0, 0.1)))

top10plot
# Total population of the top 10 Planning Area
cntpoptop10 <- top10PA %>%
  summarise(Pop = sum(Pop, na.rm = TRUE)) 
cat(cntpoptop10$Pop)
2358550
# The percentage of the population of the top 10 most populous Planning Area relative to the total population.
perc_pop_top10 <- (cntpoptop10 / cntpop) * 100
cat(perc_pop_top10$Pop)
56.32062

Insights Plot 1

  • Slightly over half (56.3%) of the 4,187.72 thousand (~4.19 million) residents in Singapore stayed in the top 10 planning areas of residence.

  • There were five planning areas with more than 250,000 residents each, namely Tampines, Bedok, Sengkang, Jurong West, and Woodlands.

  • Tampines was the most populated with 284,720 residents.

5.2 Age Distribution

AG_levels <- c(
  "0_to_4", "5_to_9", "10_to_14", "15_to_19", "20_to_24",
  "25_to_29", "30_to_34", "35_to_39", "40_to_44", "45_to_49",
  "50_to_54", "55_to_59", "60_to_64", "65_to_69", "70_to_74",
  "75_to_79", "80_to_84", "85_to_89", "90_and_over"
)

AGsum2 <- demographic_data_clean %>%
  group_by(AG) %>%
  summarise(Pop = sum(Pop, na.rm = TRUE), .groups = "drop") %>%
  mutate(AG = factor(AG, levels = AG_levels)) %>%
  arrange(AG) %>%
  mutate(
    Pop_share = Pop / sum(Pop),
    cum_share = cumsum(Pop_share),
    percentile = round(cum_share * 100, 1)
  )

# Get the median group
median_AGsum2 <- AGsum2 %>%
  filter(cumsum(Pop) >= sum(Pop) / 2) %>%
  slice(1) %>%
  pull(AG)

# Get the third quantile or 75 percentile
q3_AGsum2 <- AGsum2 %>%
  filter(cumsum(Pop) >= sum(Pop) * 0.75) %>%
  slice(1) %>%
  pull(AG)

AGsumplot2 <- AGsum2 %>%
  ggplot(
       aes(y = Pop/1000, x = AG)) +  # reorder AG_recode by Pop
  geom_col(show.legend = FALSE, fill = "pink4") +
  geom_text(aes(label = (Pop/1000)),
            color = "black", size = 3, angle = 90, hjust = -0.2) +
  ggtitle("Age Distribution of Singapore Resident Population",
  subtitle = "in thousands (‘000)") +
  annotate("segment",
         x = median_AGsum2, xend = median_AGsum2,
         y = 0, yend = 360,
         color = "red", 
         linewidth = 0.7, 
         linetype = "dotted") +
  annotate("text",
         x = median_AGsum2, 
         y = 370,
         label = "Median",
         color = "red",
         size = 2.8)+
  annotate("segment",
         x = q3_AGsum2, xend = q3_AGsum2,
         y = 0, yend = 360,
         color = "skyblue", 
         linewidth = 0.7, 
         linetype = "dotted") +
  annotate(
    geom = "text", 
    x = q3_AGsum2, 
    y = 370,
    label = paste0("Q3"),
    color="skyblue",
    size = 2.8)+
  labs(
    y = NULL,
    x = "Age Group",
    caption = "Source: singstat.gov.sg") +
  theme_ipsum(base_family = "Arial",
  plot_title_size = 14,
  subtitle_size = 10,
  caption_size = 8,
  plot_title_face = "bold",
  caption_face = "italic",
  grid = "Y",
  axis_title_face = "bold",
  axis_title_size = 11) +
  theme(axis.text.x = element_text(size=8, face="bold", angle = -45, hjust = 0),
        axis.text.y = element_blank(), #element_text(size=11, face="bold"),
        axis.title.x = element_text(size=11, face="bold", hjust = 0.5)
        )+
  scale_y_continuous(expand = expansion(mult = c(0, 0.02))) + 
  scale_x_discrete(labels=c("0_to_4" = "0-4",
      "5_to_9" = "5-9",
      "10_to_14" = "10-14",
      "15_to_19" = "15-19",
      "20_to_24" = "20-24",
      "25_to_29" = "25-29",
      "30_to_34" = "30-34",
      "35_to_39" = "35-39",
      "40_to_44" = "40-44",
      "45_to_49" = "45-49",
      "50_to_54" = "50-54",
      "55_to_59" = "55-59",
      "60_to_64" = "60-64",
      "65_to_69" = "65-69",
      "70_to_74" = "70-74",
      "75_to_79" = "75-79",
      "80_to_84" = "80-84",
      "85_to_89" = "85-89",
      "90_and_over" = "> 90"))

AGsumplot2
# Details of the AG distribution
AGsum2
# A tibble: 19 × 5
   AG             Pop Pop_share cum_share percentile
   <fct>        <dbl>     <dbl>     <dbl>      <dbl>
 1 0_to_4      170930   0.0408     0.0408        4.1
 2 5_to_9      202420   0.0483     0.0892        8.9
 3 10_to_14    204610   0.0489     0.138        13.8
 4 15_to_19    211560   0.0505     0.189        18.9
 5 20_to_24    225020   0.0537     0.242        24.2
 6 25_to_29    270090   0.0645     0.307        30.7
 7 30_to_34    321010   0.0767     0.383        38.3
 8 35_to_39    315180   0.0753     0.459        45.9
 9 40_to_44    310700   0.0742     0.533        53.3
10 45_to_49    301820   0.0721     0.605        60.5
11 50_to_54    307760   0.0735     0.678        67.8
12 55_to_59    294500   0.0703     0.749        74.9
13 60_to_64    297020   0.0709     0.820        82  
14 65_to_69    266580   0.0637     0.883        88.3
15 70_to_74    206760   0.0494     0.933        93.3
16 75_to_79    134810   0.0322     0.965        96.5
17 80_to_84     77750   0.0186     0.983        98.3
18 85_to_89     44050   0.0105     0.994        99.4
19 90_and_over  25150   0.00601    1           100  

Insights Plot 2

  • The median age falls within 40 to 44 age group with 25% of the population is aged 60 and above and 18% aged 65 and older. This indicates an aging population, a trend characterized by an increasing proportion of older individuals (typically defined as 65 years and over).

  • The youngest age groups (0–4, 5–9, 10–14) collectively account for only about 13.8% of the population, suggesting lower birth rates in recent years.

  • According to the Ministry of Manpower, the working-age population is defined as those aged 15 to 64, who make up approximately 68.2% of the total population. This reflects a strong labour force, though future demographic challenges may arise as this group continues to age.

5.3 Sex distribution by Age Group

pyramid_data <- demographic_data_clean %>%
  group_by(AG, Sex) %>%
  summarise(Population = (sum(Pop, na.rm = TRUE)/1000)) %>%
  ungroup()

totpop_pyramid <- sum(pyramid_data$Population)

pyramid_data <- pyramid_data %>%
  mutate(PopPercentage = ifelse(Sex == "Females",
                                -round(Population / totpop_pyramid * 100, 2),
                                round(Population / totpop_pyramid * 100, 2)),
         Signal = ifelse(Sex == "Females", -1, 1))

pyramid_data$AG <- factor(pyramid_data$AG,
                           levels = c("0_to_4", "5_to_9", "10_to_14", "15_to_19",
                                      "20_to_24", "25_to_29", "30_to_34", "35_to_39",
                                      "40_to_44", "45_to_49", "50_to_54", "55_to_59",
                                      "60_to_64", "65_to_69", "70_to_74", "75_to_79",
                                      "80_to_84", "85_to_89", "90_and_over"),
                           labels = c("0-4", "5-9", "10-14", "15-19",
                                      "20-24", "25-29", "30-34", "35-39",
                                      "40-44", "45-49", "50-54", "55-59",
                                      "60-64", "65-69", "70-74", "75-79",
                                      "80-84", "85-89", ">90"),
                           ordered = TRUE)


pyramid_plot <- ggplot(pyramid_data, aes(x = AG, y = PopPercentage, fill = Sex)) +
  geom_bar(stat = "identity") +
  geom_text(aes(y = PopPercentage + Signal * 0.5, label = abs(PopPercentage)),
            size = 3, color = "black") +
  coord_flip() +
  scale_fill_manual(values = c("Females" = "pink2", "Males" = "steelblue"),
                    guide = guide_legend(override.aes = list(fill = NA))) +
  scale_y_continuous(labels = abs) +
  ggtitle("Population Pyramid of Singapore Residents\nby Age and Sex 2024",
  subtitle = "in percentage (%)") +
  labs(
    y = "Population (%)",
    x = "Age Group",
    fill = "Sex",
    caption = "Source: singstat.gov.sg") +
  theme_ipsum(base_family = "Arial",
  plot_title_size = 14,
  subtitle_size = 10,
  caption_size = 8,
  plot_title_face = "bold",
  caption_face = "italic",
  grid = "Y",
  axis_title_face = "bold",
  axis_title_size = 11,
  axis_text_size = 8) +
  theme(
    strip.text = element_text(face = "bold"),
    axis.title.x = element_text(hjust = 0.5),
    axis.title.y = element_text(hjust = 0.5),
    #axis.text.y = element_text(size = 8),
    legend.position = "top",
    legend.title = element_blank(),
    legend.justification = c(0.45, 0),
    legend.margin = margin(t = -20, r = 0, b = -10, l = 0, unit = "pt"))

pyramid_plot
# Details of the population pyramid distribution
pyramid_data
# A tibble: 38 × 5
   AG    Sex     Population PopPercentage Signal
   <ord> <chr>        <dbl>         <dbl>  <dbl>
 1 0-4   Females       83.4         -1.99     -1
 2 0-4   Males         87.5          2.09      1
 3 10-14 Females      100.          -2.4      -1
 4 10-14 Males        104.           2.49      1
 5 15-19 Females      104.          -2.49     -1
 6 15-19 Males        107.           2.56      1
 7 20-24 Females      110.          -2.64     -1
 8 20-24 Males        115.           2.74      1
 9 25-29 Females      135.          -3.22     -1
10 25-29 Males        135.           3.23      1
# ℹ 28 more rows

Insights Plot 3

  • In the youngest age group (0-4), there are slightly more males than females.

  • The 25-29 age group shows a near-equal population size for both genders.

  • The gender gap widens in the older age cohorts, with majority female in the 80+ age groups. This show that females live longer than males on average, consistent with the life expectancy at birth between the different gendersfrom 2023 report by the Singapore Department of Statistics.

6 Reference

- ggplot for categorical-data
- Describe function
- gt package
- theme for ggplot2
- Recode Values with dplyr
- Customize tick marks and labels
- National Statistical Standards Recommendations on Definition and Classification of Age
- Cencus of Population 2020
- Population Pyramid Plot
- Ageing Population

7 Appendix

7.1 Appendix A

Column Name Description
PA Planning Area
SZ Subzone
AG Age Group
Sex Sex
FA Floor Area of Residence
Pop Resident Count (Population)
Time Time / Period